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Abstract 

Discovering the 3D atomic structure of molecules such 
as proteins and viruses is a fundamental research problem 
in biology and medicine. Electron Cryomicroscopy (Cryo- 
EM) is a promising vision-based technique for structure es¬ 
timation which attempts to reconstruct 3D structures from 
2D images. This paper addresses the challenging prob¬ 
lem of 3D reconstruction from 2D Cryo-EM images. A 
new framework for estimation is introduced which relies 
on modern stochastic optimization techniques to scale to 
large datasets. We also introduce a novel technique which 
reduces the cost of evaluating the objective function dur¬ 
ing optimization by over five orders or magnitude. The net 
result is an approach capable of estimating 3D molecular 
structure from large scale datasets in about a day on a sin¬ 
gle workstation. 

1. Introduction 

Discovering the 3D atomic structure of molecules such 
as proteins and viruses is a fundamental research problem 
in biology and medicine. The ability to routinely determine 
the 3D structure of such molecules would potentially revo¬ 
lutionize the process of drug development and accelerate re¬ 
search into fundamental biological processes. Electron Cry¬ 
omicroscopy (Cryo-EM) is an emerging vision-based ap¬ 
proach to 3D macromolecular structure determination that 
is applicable to medium to large-sized molecules in their na¬ 
tive state. This is in contrast to X-ray crystallography which 
requires a crystal of the target molecule, which are often im¬ 
possible to grow [32] or nuclear magnetic resonance (NMR) 
spectroscopy which is limited to relatively small molecules 
[15]. 

The Cryo-EM reconstruction task is to estimate the 3D 
density of a target molecule from a large set of images of the 
molecule (called particle images). The problem is similar in 
spirit to multi-view scene carving [6, 16] and to large-scale, 
uncalibrated multi-view reconstruction [1]. Like multi-view 
scene carving, the goal is to estimate a dense 3D occupancy 
representation of shape from a set of different views, but 


Eigure 1: The goal is to reconstruct the 3D structure of a 
molecule (right), at nanometer scales, from a large number 
of noisy, uncalibrated 2D projections obtained from cryo- 
genically frozen samples in an electron microscope (left). 

unlike many approaches to scene carving, we do not as¬ 
sume calibrated cameras, since we do not know the relative 
3D poses of the molecule in different images. Like uncali¬ 
brated, multi-view reconstruction, we aim to use very large 
numbers of uncalibrated views to obtain high fidelity 3D 
reconstructions, but the low signal-to-noise levels in Cryo- 
EM (often as low as 0.05 [4]; see Eig. 1) and the different 
image formation model prevent the use of common feature 
matching techniques to establish correspondences. Com¬ 
puted Tomography (CT) [13, 10] uses a similar imaging 
model (orthographic integral projection), however in CT the 
projection direction of each image is known whereas with 
Cryo-EM the relative pose of each particle is unknown. 

Existing Cryo-EM techniques, e.g., [7, 11, 34, 37], suf¬ 
fer from two key problems. Eirst, without good initializa¬ 
tion, they converge to poor or incorrect solutions [12], often 
with little indication that something went wrong. Second, 
they are extremely slow, which limits the number of parti¬ 
cles images one can use as input to mitigate the effects of 
noise; e.g., the website of the RELION package [34] reports 
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requiring two weeks on 300 cores to process a dataset with 
200,000 images. 

We introduce a framework for Cryo-EM density estima¬ 
tion, formulating the problem as one of stochastic optimiza¬ 
tion to perform maximum-a-posteriori (MAP) estimation in 
a probabilistic model. The approach is remarkably efficient, 
providing useful low resolution density estimates in an hour. 
We also show that our stochastic optimization technique is 
insensitive to initialization, allowing the use of random ini¬ 
tializations. We further introduce a novel importance sam¬ 
pling scheme that dramatically reduces the computational 
costs associated with high resolution reconstruction. This 
leads to speedups of 100,000-fold or more, allowing struc¬ 
tures to be determined in a day on a modern workstation. In 
addition, the proposed framework is flexible, allowing parts 
of the model to be changed and improved without impacting 
the estimation; e.g., we compare the use of three different 
priors. To demonstrate our method, we perform reconstruc¬ 
tions on two real datasets and one synthetic dataset. 

2. Background and Related Work 

In Cryo-EM, a purified solution of the target molecule 
is cryogenically frozen into a thin (single molecule thick) 
film, and imaged with a transmission electron microscope. 
A large number of such samples are obtained, each of which 
provides a micrograph containing hundreds of visible, in¬ 
dividual molecules. In a process known as particle pick¬ 
ing, individual molecules are selected, resulting in a stack 
of cropped particle images. Particle picking is often done 
manually, however there have been recent moves to partially 
or fully automate the process [17, 40]. Each particle image 
provides a noisy view of the molecule, but with unknown 
3D pose, see Eig. 2 (right). The reconstruction task is to es¬ 
timate the 3D electron density of the target molecule from 
the potentially large set of particle images. 

Common approaches to Cryo-EM density estimation, 
e.g., [7, 11, 37], use a form of iterative refinement. Based 
on an initial estimate of the 3D density, they determine the 
best matching pose for each particle image. A new density 
estimate is then constructed using the Eourier Slice The¬ 
orem (EST); i.e., the 2D Eourier transform of an integral 
projection of the density corresponds to a slice through the 
origin of the 3D Eourier transform of that density, in a plane 
perpendicular to the projection direction [13]. Using the 
3D pose for each particle image, the new density is found 
through interpolation and averaging of the observed particle 
images. 

This approach is fundamentally limited in several ways. 
Even if one begins with the correct 3D density, the low SNR 
of particle images makes accurately identifying the correct 
pose for each particle nearly impossible. This problem is 
exacerbated when the initial density is inaccurate. Poor 
initializations result in estimated structures that are either 


clearly wrong (see Eig. 9) or, worse, appear plausible but 
are misleading in reality, resulting in incorrectly estimated 
3D structures [12]. Einally, and crucially for the case of 
density estimation with many particle images, all data are 
used at each refinement iteration, causing these methods to 
be extremely slow. Mallick et al. [25] proposed an approach 
which attempted to establish weak constraints on the rela¬ 
tive 3D poses between different particle images. This was 
used to initialize an iterative refinement algorithm to pro¬ 
duce a final reconstruction. In contrast, our refinement ap¬ 
proach does not require an accurate initialization. 

To avoid the need to estimate a single 3D pose for each 
particle image, Bayesian approaches have been proposed in 
which the 3D poses for the particle images are treated as la¬ 
tent variables, and then marginalized out numerically. This 
approach was originally proposed by Sigworth [35] for 2D 
image alignment and later by Scheres et al. [33] for 3D es¬ 
timation and classification. It was since been used by Jaitly 
et al. [14], where batch, gradient-based optimization was 
performed. Nevertheless, due to the computational cost of 
marginalization, the method was only applied to small num¬ 
bers of class-average images which are produced by cluster¬ 
ing, aligning and averaging individual particle images ac¬ 
cording to their appearance, to reduce noise and the number 
of particle images used during the optimization. More re¬ 
cently, pose marginalization was applied directly with par¬ 
ticle images, using a batch Expectation-Maximization algo¬ 
rithm in the RELION package [34] . However, this approach 
is extremely computationally expensive. Here, the proposed 
approach uses a similar marginalized likelihood, however 
unlike previous methods, stochastic rather than batch op¬ 
timization is used. We show that this allows for efficient 
optimization, and for robustness with respect to initializa¬ 
tion. We further introduce a novel importance sampling 
technique that dramatically reduces the computational cost 
of the marginalization when working at high resolutions. 

3. A Framework for 3D Density Estimation 

Here we present our framework for density estimation 
which includes a probabilistic generative model of image 
formation, stochastic optimization to cope with large-scale 
datasets, and importance sampling to efficiently marginalize 
over the unknown 3D pose of the particle in each image. 

3.1. Image Formation Model 

In Cryo-EM, particle images are formed as orthographic, 
integral projections of the electron density of a molecule, 
V G . In each image, the density is oriented in an un¬ 
known pose, R G iSO(3), relative to the direction of the 
microscope beam. The projection along this unknown di¬ 
rection is a linear operator, which is represented by the ma¬ 
trix Pr G . Along with pose, the in-plane transla¬ 

tion t G of each particle image is unknown, the effect of 
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Figure 2: A generative image formation model in Cryo-EM. The electron beam results in an orthographic integral projection 
of the electron density of the specimen. This projection is modulated by the Contrast Transfer Function (CTF) and corrupted 
with noise. The images pictured here showcase the low SNR typical in Cryo-EM. The zeros in the CTF (which completely 
destroy some spatial information) make estimation particularly challenging, however their locations vary as a function of 
microscope parameters. These are set differently across particle images in order to mitigate this problem. Particle images 
and density from [18]. 


which is similarly represented by a matrix St G 
The resulting shifted projection is corrupted by two phe¬ 
nomena: a contrast transfer function (CTF) and noise. The 
CTF is analogous to the effects of defocus in a conventional 
light camera and can be modelled as a convolution of the 
projected image. This linear operation is represented here 
by the matrix G where 0 are the parameters of 

the CTF model [30]. The Fourier spectrum of a typical CTF 
is shown in Figure 2; note the phase changes which result 
in zero crossings (not typically observed in traditional light 
cameras) and the attenuation at higher frequencies which 
makes estimation particularly challenging. CTF parame¬ 
ters, 0, are assumed to be given; CTF estimation is beyond 
the scope of this work, but is routinely done using existing 
tools, e.g., [24, 27]. 

As noted above, and clearly seen in Figure 2, there is 
a large amount of noise present in typical particle images. 
This is primarily due to the sensitive nature of biological 
specimens, requiring extremely low exposures. The noise 
is modelled using an IID Gaussian distribution, resulting in 
the following expression for the conditional distribution of 
a particle image, X G , 

p{I I e, R, t, V) = Mil I CeStPRV, a^l) (1) 

where a is the standard deviation of the noise and A/’(' |/i, H) 
is the multivariate normal distribution with mean vector fi 
and covariance matrix S. 

In practice, due to computational considerations. Equa¬ 
tion (1) is evaluated in Fourier space, making use of the 


Fourier Slice Theorem and ParsevaTs Theorem to obtain 

Pix I e, R, t, V) = Mil I C^StPRV, a^I) (2) 

where X is the 2D Fourier transform of the image, St is 
the shift operator in Fourier space (a phase change), 
is the CTF modulation in Fourier space (a diagonal oper¬ 
ator), Pr is a sine interpolation operator which extracts a 
plane through the origin defined by the projection orienta¬ 
tion R and V is the 3D Fourier transform of V. To speed the 
computation of the likelihood, and due to the level of noise 
and attenuation of high frequencies by the CTF, a maximum 
frequency is specified, cc, beyond which frequencies are ig¬ 
nored. 

The 3D pose, R, and shift, t, of each particle image are 
unknown and treated as latent variables which are marginal¬ 
ized out [35, 33]. Assuming R and t are independent of 
each other and the density V, one obtains 

p{X\0,V)= [ [ p(X|(9,R,t,V)p(R)p(t)dRdt 
JsO(3) 

(3) 

where p(R) is a prior over 3D poses, R G SO{3), and p{t) 
is a prior over translations, t G In general, nothing is 
known about the projection direction so p(R) is assumed to 
be a uniform distribution. Particles are picked to be close 
to the center of each image, so p(t) is chosen to be a Gaus¬ 
sian distribution centered in the image. The above double 
integral is not analytically tractable, so numerical quadra¬ 
ture is used [22, 9]. The conditional probability of an image 











(likelihood) then becomes 

Mr Mt 

p(X|0,V)« E<E wIp{X\0, R^-, V)p(R)p(t) 

3=1 £=1 

(4) 

where {(Rj, are weighted quadrature points over 

50(3) and {(t^,u;*)}^\ are weighted quadrature points 
over . The accuracy of the quadrature scheme, and con¬ 
sequently the values of Mr and Mt, are set automatically 
based on cc, the specihed maximum frequency such that 
higher values of cj results in more quadrature points. 

Given a set of K images with CTF parameters D = 
{ (Zi , and assuming conditional independence of the 

images, the posterior probability of a density V is 

K 

p{v\^)^p{v)l[p{ii\ei,v) (5) 

i=l 

where p{V) is a prior over 3D molecular electron densities. 
Several choices of prior are explored below, but we found 
that a simple independent exponential prior worked well. 

Specihcally, p{V) = nS=i where Vi is the value 

of the ith voxel and A is the inverse scale parameter. Other 
choices of prior are possible and is a promising direction for 
future research. 

Estimating the density now corresponds to Ending V 
which maximizes Equation (5). Taking the negative log 
and dropping constant factors, the optimization problem be¬ 
comes argmin^^j^RS /(V), 

K 

f{v) = -iogp{v) -Y,^ogp{ii\ei,v) (6) 

i=l 

where V is restricted to be positive (negative density is phys¬ 
ically unrealistic). Optimizing Eq. (6) directly is costly due 
to the marginalization in Eq. (4) as well as the large num¬ 
ber (K) of particle images in a typical dataset. To deal with 
these challenges, the following sections propose the use of 
two techniques, namely, stochastic optimization and impor¬ 
tance sampling. 

3.2. Stochastic Optimization 

In order to efficiently cope with the large number of 
particle images in a typical dataset, we propose the use 
of stochastic optimization methods. Stochastic optimiza¬ 
tion methods exploit the large amount of redundancy in 
most datasets by only considering subsets of data (i.e., im¬ 
ages) at each iteration by rewriting the objective as /(V) = 
X]/c//c(V) where each fk{V) evaluates a subset of data. 
This allows for fast progress to be made before a batch op¬ 
timization algorithm would be able to take a single step. 

There are a wide range of such methods, ranging from 
simple stochastic gradient descent with momentum [28, 29, 


36] to more complex methods such as Natural Gradient 
methods [2, 3, 19, 20] and Hessian-free optimization [26]. 
Here we propose the use of Stochastic Average Gradient 
Descent (SAGD) [2 1 ] which has several important advan¬ 
tages. First, it is effectively self-tuning, using a line-search 
to determine and adapt the learning rate. This is particu¬ 
larly important, as many methods require signihcant man¬ 
ual tuning for new objective functions and, potentially, each 
new dataset. Further, it is specihcally designed for the hnite 
dataset case allowing for faster convergence. 

At each iteration r, SAGD [21] considers only a single 
subset of data, which dehnes part of the objective func¬ 
tion fk^ (V) and its gradient (V). The density V is then 
updated as 

K 

i=i 

where e is a base learning rate, L is a Lipschitz constant of 
g/e(V), and 


dVl 


dVl~^ Otherwise 


( 8 ) 


is the most recent gradient evaluation of datapoint j at it¬ 
eration r. This step can be computed efficiently by stor¬ 
ing the gradient of each observation and updating a run¬ 
ning sum each time a new gradient is seen. The Lipschitz 
constant L is not generally known but can be estimated us¬ 
ing a line-search technique. Theoretically, convergence oc¬ 
curs for values of e < ^ [21], however in practice larger 
values at early iterations can be benehcial, thus we use 
e = max(^, ). To allow parallelization and re¬ 

duce the memory requires of SAGD, the data is divided into 
minibatches of 200 particles images. Finally, to enforce the 
positivity of density, negative values of V are truncated to 
zero after each iteration. More details of the stochastic op¬ 
timization can be found in the Supplemental Material. 

3.3. Importance Sampling 

While stochastic optimization allows us to scale to large 
datasets, the cost of computing the required gradient for 
each image remains high due to the marginalization over 
orientations and shifts. Intuitively, one could consider ran¬ 
domly selecting a subset of the terms in Eq. (4) and using 
this as an approximation. This idea is formalized by impor¬ 
tance sampling (IS) which allows for an efficient and accu¬ 
rate approximation of the discrete sums in Eq. (4).^ A full 
review of importance sampling is beyond the scope of this 
paper but we refer readers to [38]. 

^One can also apply importance sampling directly to the continuous 
integrals in Eq. (3) but it can be computationally advantageous to precom¬ 
pute a fixed set of projection and shift matrices, Pr and St, which can be 
reused across particle images. 



To apply importance sampling, consider the inner sum 
from Eq. (4), rewriting it as 


Mt 


Mt 


= = 


i=l 


i=l 


\ ) 


(9) 


where pj^e = p{'i\0,'Rj,te,V)p{'Rj)p{A and Q* = 
(^15 • • • 5 Qm^ is the parameter vector of a multinomial im¬ 
portance distribution such that = ^ > 0- 

The domain of q* corresponds to the set of quadrature 
points in Equation (4). Then, can be thought of as the 

t 

expected value where ^ is a random variable dis¬ 

tributed according to q*. If a set of Nt Mt random 
indexes 3^ are drawn according to q*, then 


pf 


— E 

Nt ^ 


ie3* 


WePj,e 

<1} 


( 10 ) 


Thus, we can efficiently approximate by drawing sam¬ 
ples according to the importance distribution q^ and com¬ 
puting the average. Using this approximation in Eq. (4) 
gives 


p(i|^,V): 


Mr 




and importance sampling can be similarly used for the outer 
summation to give 


p{i\e,v)^ E 


Nnqf 
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Pj,e 


( 12 ) 


where 3*^ are samples drawn from the importance distribu¬ 
tion = (q'^, ..., '^^®d for approximating 
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wfpj,e 




(13) 
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je3^ 


The accuracy of the approximation in Eq. (12) is controlled 
by the number of samples used, with the error going to 
zero as N increases. We use N = so5(q) samples where 
Qi) is the effective sample size [8] and sq is 
a scaling factor. This choice ensures that when the impor¬ 
tance distribution is diffuse, more samples are used. 

While the estimates provided by IS are unbiased, their 
error can be arbitrarily bad if the importance distribution is 
not well chosen. To choose a suitable importance distribu¬ 
tion, we make two observations. Eirst, the values and (j)^ 
are proportional to the marginal probability of single parti¬ 
cle image having been generated with shift or pose Rj, 
making them natural choices on which to base the impor¬ 
tance distributions. Second, these values remain stable once 



Eigure 3: The KL divergence between the values of at 
the current and previous epochs on the thermus dataset. 
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Eigure 4: Previously published structures for the datasets 
used in this paper. 


the rough shape of the structure has been determined. This 
can be seen in Eigure 3 which shows that by the third epoch 
the KL divergence of the values of from one epoch to 
the next is extremely small. 

Thus we use these quantities, computed from the previ¬ 
ous iterations, to construct the importance distributions at 
the current iteration. Dropping the R or t superscripts for 
clarity, let 3 be the set of samples evaluated at the previous 
iteration and pi be the computed values for i G 3. Then the 
importance distribution used at the current iteration is 

qj = (1 - a)Z~^pj + apj (14) 

where pj is a uniform prior distribution, a controls 
how much the previous distribution is relied on, Z = 
(f>j = Here T is an 

annealing parameter and Kij are entries of a kernel 
matrix computed on the quadrature points which dif¬ 
fuses probability to nearby quadrature points. The val¬ 
ues for a = max(0.05, ) and T = 

max(1.25, ) are set so that at early itera¬ 

tions, when the underlying density is changing, we rely 
more heavily on the prior. Eor K we use a Gaussian kernel 
for the shifts and a Eisher kernel for the orientations. The 
bandwidth of the kernel is tuned based on the current reso¬ 
lution of the quadrature scheme, e.g., the Gaussian shift ker¬ 
nel bandwidth is set to be equal to the spacing between the 
shift quadrature points. More details on importance sam¬ 
pling can be found in the Supplemental Material. 
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Figure 5: The random initialization (left) used in all ex¬ 
periments, generated by summing random spheres, and re¬ 
construction of the thermus dataset after various amounts of 
computation. Note that within an hour of computation, the 
gross structure is already well determined, after which fine 
details emerge gradually. 

4. Experiments 

The proposed method was applied to two experimental 
datasets and one synthetic dataset. All experiments used the 
same parameters and were initialized using the same ran¬ 
domly generated density shown in Figure 5 (left). The maxi¬ 
mum frequency considered was gradually increased from an 
initial value of cc = 1/40A to a maximum of cc = i/ioA. 
This maximum frequency corresponds to the resolution of 
the best published results for the datasets used here, i.e., 
[18]. Optimizations were run until the maximum resolution 
was reached and the average error on a held-out set of 100 
particle images stopped improving, around 5000 iterations. 

Datasets The first dataset was ATP synthase from the 
thermus thermophilus bacteria, a large transmembrane 
molecule. The thermus dataset consisted of 46,105 par¬ 
ticle images which were provided by Lau and Rubinstein 
[18]. The high resolution structure from [18] and some 
sample images are shown in Figure 2. The second dataset 
was bovine mitochondrial ATP synthase [31]. The bovine 
dataset, provided by Rubinstein et al. [31], consisted of 
5,984 particle images. In all cases the particle images pro¬ 
vided were 128 x 128, had a resolution of 2.8A (0.28nm 
per pixel) and CTF information for each particle image was 
provided. The noise level, a, was estimated by computing 
the standard deviation of pixels around the boundary of the 
particle images. 

To showcase the ability of our method to handle a dra¬ 
matically different type of structure, a third dataset was 
synthesized by taking an existing structure from the Pro¬ 
tein Data Bank^, GroEL-GroES-(ADP)7 [39], and gener¬ 
ating 40, 000 random projections according to the genera¬ 
tive model. GTE, signal-to-noise level and other parameters 
were set realistically based on the thermus dataset values. 

^Structure 1 AON from http : / /pdb . org 
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Eigure 6: Relative error (blue, left axis) and fraction of 
total quadrature points (red, right axis) used in computing 
logp(X|6>,V) as a function of the ESS scaling factor, sq 
( horizontal axis). Note the log-scale of the axes. 

This structure, as well as previously solved structures of the 
bovine and thermus ATP synthase molecules are shown in 
Eigure 4. GroEL-GroES, was selected because it is struc¬ 
turally unlike either of the bovine or thermus ATP synthase 
molecules. Sample GroEL synthetic images can be see in 
Eigure 11 (top left). 

Results of our method on these datasets are shown in Eig¬ 
ure 11. Sample particle images are shown, along with an 
iso-surface and slices of the final estimated density. Com¬ 
puting these reconstructions took less than 24 hours in all 
cases. Eurther, even at early iterations, reasonable structures 
are available. Eigure 5 shows the estimated structure for the 
thermus dataset over time during optimization. Notably, af¬ 
ter just one hour (during which only a fraction of the full 
dataset is seen), the low-resolution shape of the structure 
has already been determined. 

Importance Sampling To validate our importance sam¬ 
pling approach we evaluated the error made in computing 
logp(X|6>, V) using IS against computing the exact sum in 
Equation (4) without IS. This error is plotted in Eigure 6, 
along with the fraction of quadrature points used at various 
values of sq. Based on these plots we selected a factor of 
5o = 10 for all experiments as a trade-off between accuracy 
and speed achieving a relative error of less then 0.1% while 
still providing significant speedups. 

To see just how much of a speedup importance sampling 
provides in practice, we plotted in Eigure 7 the fraction of 
quadrature points which needed to be evaluated during op¬ 
timization. As can be seen initially, all quadrature points 
are evaluated but as optimization progresses and the density 
(and consequently the distribution over poses) becomes bet¬ 
ter determined importance sampling yields larger and larger 
speedups. At the full resolution, importance sampling pro¬ 
vided more than a 100,000 fold speedup. 

No prior knowledge of the orientation distribution was 
assumed. However, for many particles, certain views are 










Figure 7: The fraction of quadrature points evaluated on 
average during optimization (horizontal axis is iterations). 
As resolution increases, the speedup obtained increases sig¬ 
nificantly yielding more than a 100,000 fold speedup. 



Figure 8: A Winkel-Tripel projection of the importance dis¬ 
tribution of view directions, q^, averaged over the thermus 
dataset at a typical iteration. Clearly visible is the equatorial 
belt of likely views, while axis aligned views (those on the 
top or bottom of the plot) are rarely seen. 

more likely than others. This fact can be seen by examining 
the average importance distribution for the thermus dataset, 
shown in Figure 8 for a typical iteration. Here we can see 
clearly that the distribution of views forms an equatorial belt 
around the particle, while top or bottom views are rarely if 
ever seen. This phenomenon is well known for particles 
like these {e.g., see [31] where this knowledge was used di¬ 
rectly in estimation), validating our sampling approach and 
suggesting a use of this average importance distribution to 
supplement the uniform prior distribution in Eq. (14). 

Initialization and Comparison to State-of-the-Art To 

compare this method to existing methods for structure de¬ 
termination, we selected two representative approaches. 
The first is a standard iterative projection matching scheme 
where images are matched to an initial density through a 
global cross-correlation search [11]. The density is then re¬ 
constructed based on these fixed orientations and this pro¬ 
cess is iterated. The second is the RELION package de¬ 
scribed in [34] which uses a similar marginalized model 
as our method but with a batch EM algorithm to perform 
optimization. We used publicly available code for both of 
these approaches on the thermus dataset and initialized us¬ 
ing the density shown in Figure 5. We ran each method for 
a number of iterations roughly equivalent computationally 


Figure 9: Baseline comparisons to two existing standard 
methods. Iterative projection matching and reconstruction 
(left) and RELION [34] (middle). The proposed method 
(right) is able to determine the correct structure while pro¬ 
jection matching and RELION both become trapped in poor 
local optima. See Fig. 9(middle) for comparison. All meth¬ 
ods used the same random initialization shown in Fig. 5. 
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Figure 10: Slices through the reconstructions with (from 
left to right) uniform, CAR and exponential priors. The ex¬ 
ponential prior does the best job of suppressing noise in the 
background without oversmoothing fine details within the 
structure. Blue corresponds to small or zero density and red 
corresponds to high density. 

to the 5000 iterations used by our method and the results are 
shown in Figure 9. In both cases the approaches had clearly 
determined an incorrect structure and appeared to have con¬ 
verged to a local minimum as no further progress was made. 
Both projection matching and RELION have been used suc¬ 
cessfully for reconstruction by others and are not recom¬ 
mended to be used without a good initialization. Our results 
support this recommendation as neither approach converges 
from random initializations. In practice, it is difficult to con¬ 
struct good initializations for molecules of unknown shape 
[12], giving our proposed method a significant advantage. 

Comparing Priors The above results used an exponen¬ 
tial prior for the density at the voxels of Vi, however the 
presented framework allows for any continuous and dif¬ 
ferentiable prior to be used. To demonstrate this, we 
explored two other priors: an improper (uniform) prior, 
p(Vi) (X 1, and a conditionally autoregressive (CAR) prior 
[5]p(Vi|V_i) = which 

is a smoothness prior biasing each voxel towards the mean 
of its 26 immediate neighbours Nbhd(i). Slices through 
the resulting densities on thermus under these priors are 
shown in Figure 10. With an improper uniform prior (Fig. 























Figure 11: Sample particle images (left), an isosurface of the reconstructed 3D density (middle) and slices through the 3D 
density with colour indicating relative density (right) for GroEL-GroES (top), thermus thermophilus ATPase (middle) and 
bovine mitochondrial ATPase (bottom). The relative root expected mean squared error (RREMSE) on a held-out test set 
was 0.99, 0.96 and 0.98 with values relative to the estimated noise level. See Supplemental Material for more on the error 
measure. Reconstructions took a day or less on a 16 core workstation. 


10,left), there is significant noise visible in the background. 
This noise is somewhat suppressed with the CAR prior (Pig. 
10,middle) however the best results are clearly obtained us¬ 
ing the exponential prior which suppresses the background 
noise without smoothing out internal details. 

5. Conclusions 

This paper introduces a framework for efficient 3D 
molecular reconstruction from Cryo-EM images. It com¬ 
prises MAP estimation of 3D structure with a genera¬ 
tive model, marginalization over 3D particle poses, and 
optimization using SAGD. A novel importance sampling 
scheme was used to reduce the computational cost of 
marginalization. The resulting approach can be applied to 
large stacks of Cryo-EM images, providing high resolution 
reconstructions in a day on a 16-core workstation. 

The problem of density estimation for Cryo-EM is a fas¬ 
cinating vision problem. The low SNR in particle images 
makes it remarkable that any molecular structure can be es¬ 


timated, let alone the high resolution densities which are 
now common. Recent research [23] suggests that the com¬ 
bination of new techniques and new sensors may facilitate 
atomic resolution reconstructions for arbitrary molecules. 
This development will be ground-breaking in both biologi¬ 
cal and medical research. 

Beyond the work described in this paper, there remain a 
number of unresolved questions for future research. While 
an exponential prior was found to be effective, more sophis¬ 
ticated priors could be learned, potentially enabling higher 
resolution estimation without the need to collect more data 
and providing a kind of of atomic-scale super-resolution. 
The optimization problem is challenging, and, while SAGD 
was successful here, it is likely that more efficient stochastic 
optimization methods are possible by exploiting the prob¬ 
lem structure to a greater degree. In order to encourage oth¬ 
ers to work on this problem, source code will be available 
from the authors’ website. 
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the sum in the above update equation is not computed at 
each iteration, but rather a running total is maintained and 
updated as follows: 
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The SAGD algorithm requires a Lipschitz constant L 
which is not generally know. Instead it is estimated us¬ 
ing a line search algorithm where an initial value of L is 
increased until the instantiated Lipschitz condition /(V) — 
/(V — L~^dV) < is met. The line search for the 

Lipschitz constant L is only performed once every 20 iter¬ 
ations. Note that a more sophisticated line search could be 
performed if desired. A good initial value of L is found us¬ 
ing a bisection search where the upper bound is the smallest 
L found so far to satisfy the condition and the lower bound 
is the largest L found so far which fails the condition. In 
between line searches, L is gradually decreased to try to 
take larger steps. The entire SAGD algorithm is provided in 
Algorithm (1). 

B. Importance Sampling 

Importance Sampling is a key part of the proposed recon¬ 
struction method for Cryo-EM and provides large speedups 







Algorithm 1 SAGD 
Initialize V and L 
Initialize g ^ 0 

Initialize dVk ^ 0 for all /c = 1..K 

for T = l . Tmax do 

Select data subset kr 
Compute objective gradient (V) 
g <- g - dVk^ + gfc^ (V) 
dVk^^SkAV) 

V^V-f [g-^logp(V)] 
if mod(T,20) == 0 then 
Perform line search 

while /fc,(V) - fkAV - L-^dVk,) < do 

L^2L 

end while 

else 

2 150 

end if 
end for 


during optimization. We use importance sampling to effi¬ 
ciently compute the discrete sum in Equation (4). Note that 
importance sampling is applied independently for each im¬ 
age in the dataset, since the orientations and shifts which 
correspond to important terms in the discrete sum can be 
different for each image. 

In practice, we split the outer sum in Equation (4) into a 
double summation, one over orientations on the sphere and 
one over in-plane rotations of images and projections. We 
then compute each of the three sums (over shift, in-plane 
rotation, and orientation) with and independent importance 
sampler. This is equivalent to computing the full sum in 
Equation (4) using a single importance sampler with an im¬ 
portance distribution that is factored into three parts, one 
for each of shift, in-plane rotation, and orientation. This 
factoring is necessary, as the memory requirements of stor¬ 
ing a fully joint importance distribution for each image in 
the dataset would become infeasible for high-resolution re¬ 
constructions. 

Eor each of the three importance samplers, the impor¬ 
tance distribution at each iteration is constructed accord¬ 
ing to Equation (14). At the first iteration during which a 
particular image is seen, the importance distribution is sim¬ 
ply uniform, and in fact we explicitly sample every point 
once. The f values resulting from this computation are 
stored. At the next iteration during which the same image is 
seen, these (j) values are used in Equation (14) to construct 
a non-uniform importance distribution which is then sam¬ 
pled from. We use a number of samples proportional to the 
effective sample size of the importance distribution, so the 
number of samples used naturally decreases as the impor¬ 
tance distribution becomes more peaked, leading to large 
speedups at late iterations during optimization. 


Algorithm 2 Importance Sampling 
Given fox i ^3 from previous iteration 

for j e 1. .J do 
for z G IT do 
Compute 
end for 
end for 

Qj ^ (1 - + atpj Vj € 1..J 

N ^ sqs 
3^0 

fork G do 

i ^ sample from q 
insert i into 3 

end for 

Use J to compute for next iteration 


In Equation (14), the previous f values are not directly 
used, but rather they are annealed by a temperature param¬ 
eter and then smoothed by a kernel matrix. Both of these 
steps serve to guard against importance distributions which 
are too peaked around large f values, which would inhibit 
the importance sampler from exploring the domain. The 
kernel matrix also serves the purpose of allowing use of f 
values from a previous iteration even if the resolution of 
quadrature points being used has increased at the current 
iteration. The Von Mises-Eisher kernel is used for orienta¬ 
tions and in-plane rotations, while a Gaussian kernel is used 
for shift: 

'Kv{di^dj; Kv) oc ex.p{KvdJdj) 

KG{ti,tj;KG) OC exp(-KG||ti - tjll^) 

where ny and k^g are precision parameters for each ker¬ 
nel which are set based on the resolution of the quadrature 
scheme used at the previous f values, di and dj are the 
quadrature directions (in 5^ for particle orientation and 
for in-plane rotation, and ti and tj are the quadrature shift 
values (in M^). 

The algorithm for constructing an importance distribu¬ 
tion and sampling from it are given in Algorithm (2). The 
sampled values are then used to compute (12). Note that 
some quadrature points can end up being sampled multiple 
times, this is detected and the value reused to reduce com¬ 
putation. 

C. Error Measures 

Because ground-truth is rarely available for Cryo-EM, 
measuring accuracy is often difficult. Traditionally, the field 
has used the Fourier Shell Correlation (ESC) to measure 









the resolution of a solved structure. The so-called gold- where the sum is taken over the test set which has Attest 
standard FSC works by splitting the dataset in half, estimat- images and is the noise variance of the dataset. Values 

ing two densities separately and the computing the normal- near 1 indicate that the data is being well explained, 

ized correlation in Fourier space as a function of frequency. 

This curve would then be thresholded to provide an estimate 
of accuracy. However, we note that this measure is actually 
estimating the variance of the estimator, not the accuracy of 
the density it has produced. Further it is only theoretically 
justifiable when the estimator is unbiased, which is not true 
of the method proposed here or with other likelihood-based 
Bayesian methods such as RELION. 

Instead, we introduce a novel metric based on recon¬ 
struction error of a held test set. To quantify the ability of 
marginal likelihood methods, such as ours, to model and 
explain the observed data we introduce the Expected Mean 
Squared Error 

V) = [\\I - QStPRVlP] (15) 

to be the expectation of the squared error between the im¬ 
age and its reconstruction under the image formation model. 

Note that the expectation is conditioned on the current den¬ 
sity and the CTF parameters and is taken over the unknown 
pose and translation, R and t. After switching to Fourier 
space and with some manipulation V) becomes 

Z-i [ [ \\i-C0StPnVfp{I\O,R,t,V)p{K)p{t)dRdt 

JsO{3) 

(16) 

where the 


Z = [ [ p(X|(9,R,t,V)p(R)p(t)dRdt (17) 

7m2 JsO{3) 

is a normalization constant. Computing this would be com¬ 
putationally expensive, instead we use an importance sam¬ 
pling based approximation, f^(X|6>, V), 
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where 
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is the approximation of the normalization constant. The 
above quantities can be readily computed along with the 
main likelihood computation using the same importance 
sampling scheme described above. 

We compute the average value of f^(X|6>, V) on a held 
out set of test images whose gradients are never used. To 
normalize for different datasets we report the Relative Root 
Expected Mean Squared Error (RREMSE) as 




